Data Visualization of the Customer Table¶

This Jupyter Notebook provides a guided analysis of our Customer and Product datasets. This notebook converts analytical outputs into interactive visualizations using Pandas and Plotly. The goal is to provide business-ready charts that answer product, customer, and sales questions (trends, top performers, segmentation, etc.).

Requirements & Setup¶

Install required packages:

In [62]:
pip install pandas plotly
Requirement already satisfied: pandas in c:\users\belki\anaconda3\envs\myenv\lib\site-packages (2.2.3)
Requirement already satisfied: plotly in c:\users\belki\anaconda3\envs\myenv\lib\site-packages (5.24.1)
Requirement already satisfied: numpy>=1.26.0 in c:\users\belki\anaconda3\envs\myenv\lib\site-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\belki\anaconda3\envs\myenv\lib\site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\belki\anaconda3\envs\myenv\lib\site-packages (from pandas) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in c:\users\belki\anaconda3\envs\myenv\lib\site-packages (from pandas) (2024.2)
Requirement already satisfied: tenacity>=6.2.0 in c:\users\belki\anaconda3\envs\myenv\lib\site-packages (from plotly) (9.0.0)
Requirement already satisfied: packaging in c:\users\belki\anaconda3\envs\myenv\lib\site-packages (from plotly) (24.1)
Requirement already satisfied: six>=1.5 in c:\users\belki\anaconda3\envs\myenv\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip

Import common libraries:

In [63]:
import pandas as pd 
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
In [2]:
import plotly.io as pio
pio.renderers.default = "notebook_connected"

1. Input Data¶

Firstly, we input the trasnformed and generated Tables from the Gold Views

In [65]:
customer = pd.read_csv('./gold.report_customer.csv', sep=';')
product = pd.read_csv('./gold.report_products.csv', sep=';')

df = pd.read_csv('./gold.report_customer_products_merged.csv', sep=';')
subset = df[["age", "category", "age_group", "total_sales", "last_order_date", "total_quantity", "total_orders"]]

2. Data Preparation¶

Before creating visualizations, we performed some column renaming to improve readability and consistency:

Customer Table Contains demographic and behavioral information for customers:

Column Description
Customer Key Unique identifier for each customer
Customer Number Customer ID number
Customer Name Name of the customer
Age Age of the customer
Age Group Age category
Customer Segment Segment classification (e.g., VIP, Regular)
Last Order Date Date of the most recent purchase
Recency (Days) Days since last purchase
Total Orders Total number of orders placed
Total Sales Total spending of the customer
Total Quantity Total products purchased
Lifespan (Days) Number of days between first and last purchase
Avg Order Value Average value of orders placed
Avg Monthly Spend Average spend per month
In [66]:
## Rename the Columns to more friendly names
cc = {
'customer_key':'Customer Key',
'customer_number' : 'Customer Number',
'customer_name' : 'Customer Name',
'age'	: 'Age',
'age_group' : 'Age Group',
'customer_segment' : 'Customer Segment',
'last_order_date' : 'Last Order Date',
'recency' : 'Recency (Days)',
'total_orders' : 'Total Orders',
'total_sales' : 'Total Sales',
'total_quantity' : 'Total Quantity',
'lifespan' : 'Lifespan (Days)',
'avg_order_value' : 'Avg Order Value',
'avg_monthly_spend' : 'Avg Monthly Spend',
}

customer.rename(columns=cc, inplace=True)
customer
Out[66]:
Customer Key Customer Number Customer Name Age Age Group Customer Segment Last Order Date Recency (Days) Total Orders Total Sales Total Quantity Lifespan (Days) Avg Order Value Avg Monthly Spend
0 1 AW00011000 Jon Yang 54.0 50 and above VIP 03.05.2013 148 3 8249 8 8 2749 294
1 2 AW00011001 Eugene Huang 49.0 40-49 VIP 10.12.2013 141 3 6384 11 10 2128 182
2 3 AW00011002 Ruben Torres 54.0 50 and above VIP 23.02.2013 151 3 8114 4 4 2704 324
3 4 AW00011003 Christy Zhu 52.0 50 and above VIP 10.05.2013 148 3 8139 9 9 2713 280
4 5 AW00011004 Elizabeth Johnson 46.0 40-49 VIP 01.05.2013 148 3 8196 6 6 2732 292
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18476 18479 AW00029478 Darren Carlson 61.0 50 and above New 26.06.2013 147 1 2398 3 3 2398 2398
18477 18480 AW00029479 Tommy Tang 56.0 50 and above New 04.09.2012 156 1 2049 1 1 2049 2049
18478 18481 AW00029480 Nina Raji 48.0 40-49 New 17.07.2013 146 1 2442 5 5 2442 2442
18479 18482 AW00029481 Ivan Suri 60.0 50 and above New 13.08.2011 169 1 3375 1 1 3375 3375
18480 18483 AW00029482 Clayton Zhang 61.0 50 and above New 18.09.2012 156 1 2049 1 1 2049 2049

18481 rows × 14 columns


Product Table Contains information about products including sales, cost, and lifecycle:

Column Description
Product Key Unique identifier for each product
Product Name Name of the product
Category Main category of the product
Subcategory Subcategory of the product
Cost Cost of the product
Last Sale Date Date of the most recent sale
Recency (Months) Months since the last sale
Product Segment Segment classification of the product
Lifespan (Months) Duration the product has been sold
Total Orders Total orders for the product
Total Sales Total revenue generated by the product
Total Customers Number of unique customers who purchased the product
Avg Selling Price Average price at which the product sold
Avg Order Revenue Average revenue per order
Avg Monthly Revenue Average revenue per month
In [67]:
pp = {
'product_key':'Product Key',
'product_name' : 'Product Name',
'category' : 'Category',
'subcategory'	: 'Subcategory',
'cost' : 'Cost',
'last_sale_date' : 'Last Sale Date',
'recency_in_months' : 'Recency (Months)',
'product_segment' : 'Product Segment',
'tlifespan' : 'Lifespan (Months)',
'total_orders' : 'Total Orders',
'total_sales' : 'Total Sales',
'total_customers' : 'Total Customers',
'avg_selling_price' : 'Avg Selling Price',
'avg_order_revenue' : 'Avg Order Revenue',
'avg_monthly_revenue' : 'Avg Monthly Revenue',
}

product.rename(columns=pp, inplace=True)
product
Out[67]:
Product Key Product Name Category Subcategory Cost Last Sale Date Recency (Months) Product Segment lifespan Total Orders Total Sales total_quantity Total Customers Avg Selling Price Avg Order Revenue Avg Monthly Revenue
0 3 Mountain-100 Black- 38 Bikes Mountain Bikes 1898 27.12.2011 165 High-Performer 11 49 165375 49 49 3375 3375 15034
1 4 Mountain-100 Black- 42 Bikes Mountain Bikes 1898 27.12.2011 165 High-Performer 11 45 151875 45 45 3375 3375 13806
2 5 Mountain-100 Black- 44 Bikes Mountain Bikes 1898 21.12.2011 165 High-Performer 11 60 202500 60 60 3375 3375 18409
3 6 Mountain-100 Black- 48 Bikes Mountain Bikes 1898 26.12.2011 165 High-Performer 12 57 192375 57 57 3375 3375 16031
4 7 Mountain-100 Silver- 38 Bikes Mountain Bikes 1912 22.12.2011 165 High-Performer 12 58 197200 58 58 3400 3400 16433
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
125 291 Touring Tire Tube Accessories Tires and Tubes 2 28.01.2014 140 Low-Performer 13 1487 7435 1487 1410 5 5 571
126 292 Classic Vest- L Clothing Vests 24 28.01.2014 140 Mid-Range 12 195 12480 195 195 64 64 1040
127 293 Classic Vest- M Clothing Vests 24 27.01.2014 140 Mid-Range 12 199 12736 199 199 64 64 1061
128 294 Classic Vest- S Clothing Vests 24 28.01.2014 140 Mid-Range 13 168 10944 171 168 64 65 841
129 295 Water Bottle - 30 oz. Accessories Bottles and Cages 2 28.01.2014 140 Mid-Range 13 4244 21245 4249 4073 5 5 1634

130 rows × 16 columns

In [68]:
subset = subset.rename(columns={
    "age": "Age",
    "category": "Category",
    "age_group": "Age Group",
    "total_sales": "Total Sales",
    "last_order_date": "Last Order Date",
    "total_quantity": "Total Quantity",
    "total_orders": "Total Orders"
})
subset["Last Order Date"] = pd.to_datetime(subset["Last Order Date"], format="%d.%m.%Y")
subset = subset.sort_values("Last Order Date", ascending=True)
subset
Out[68]:
Age Category Age Group Total Sales Last Order Date Total Quantity Total Orders
7927 55.0 Bikes 50 and above 3400 2010-12-29 1 1
14324 55.0 Bikes 50 and above 3578 2011-01-02 1 1
30827 47.0 Bikes 40-49 3578 2011-01-03 1 1
30046 75.0 Bikes 50 and above 3400 2011-01-06 1 1
30885 62.0 Bikes 50 and above 3578 2011-01-06 1 1
... ... ... ... ... ... ... ...
22685 48.0 Accessories 40-49 22 2014-01-28 1 1
20335 50.0 Accessories 50 and above 77 2014-01-28 4 2
15566 52.0 Accessories 50 and above 10 2014-01-28 2 2
27499 70.0 Accessories 50 and above 27 2014-01-28 2 2
15601 65.0 Accessories 50 and above 13 2014-01-28 2 1

31093 rows × 7 columns

In [69]:
cum = pd.read_csv('./gold.report_customer.cumulative.csv', sep=';')
cum.rename(columns={
    "order_date": "Order Date",
    "total_sales": "Total Sales",
    "runnning_total_sales": "Cumulative Sales"
}, inplace=True)

Customer Data¶

Total Sales over Last Order Date (trend of revenue).

Purpose: Show a time series of total sales (revenue) across the dataset’s last-order dates to reveal trends, seasonality, and anomalies. This chart answers questions such as: “Is revenue growing month-over-month?”, “When are the seasonal peaks?”, and “Are there unexpected dips that need investigation?”

In [ ]:
sales_by_date["Last Order Date"] = pd.to_datetime(customer["Last Order Date"], format="%d.%m.%Y")
sales_by_date = sales_by_date.sort_values("Last Order Date", ascending=True)

fig = px.line(
    sales_by_date,
    x="Last Order Date",
    y="Total Sales",
    title="Total Sales Over Time",
    labels={"Last Order Date": "Date", "Total Sales": "Sales"}, 
    template="plotly_white", 
)
fig.update_layout(title_text='Total Sales Over Time', title_x=0.5)
fig.show()

Cumulative Sales over Time

Show the running (cumulative) total of sales over time so one can easily compare pace-of-sales across calendar periods (commonly year-over-year). This visual answers questions like “Which year accumulated more revenue by this date?” and “When did sales accelerate or slow down?”

How to read it

  • The area under the curve at any x-date is the total sales accumulated up to that date.
  • Steeper slopes indicate higher sales velocity during that period.
  • If multiple yearly series are present (e.g., one trace per year), compare vertical separation at the same calendar date to see which year is ahead.
  • Crossovers indicate that one year overtook another in cumulative sales at that point.
In [71]:
fig = px.area(
    cum,
    x="Order Date",
    y="Cumulative Sales",
    title="Cumulative Sales Growth Per Year",
    labels={"Last Order Date": "Date", "Cumulative Sales": "Cumulative Sales"},
    template="plotly_white",
)
fig.update_layout(title_text='Cumulative Sales Growth Per Year', title_x=0.5)
fig.show()

Scatter line with moving average: Recency or order volume trends by month

Plot monthly order volume and overlay a 3-month moving average to reveal short-term trends and smooth month-to-month noise. This helps answer: Are monthly orders increasing or decreasing? Are there short-term trends or seasonal patterns?

How to read the chart

  • Scatter points = actual total orders recorded each month.
  • Red line (3-month MA) = smoothed trend showing the average behavior across the current month and the two preceding months.
  • Rising MA slope = sustained increase in monthly orders.
  • Falling MA slope = sustained decline.
  • MA vs points: if points sit mostly above the MA, recent months are higher than the recent average; if below, they are lower.
In [72]:
import pandas as pd
import plotly.express as px

# Ensure Last Order Date is datetime
customer["Last Order Date"] = pd.to_datetime(customer["Last Order Date"], format="%d.%m.%Y")

# Extract month (as Year-Month period)
customer["Month"] = customer["Last Order Date"].dt.to_period("M")

# Aggregate total orders per month
monthly_orders = (
    customer.groupby("Month")["Total Orders"]
    .sum()
    .reset_index()
)

# Convert Month back to datetime for plotting
monthly_orders["Month"] = monthly_orders["Month"].dt.to_timestamp()
monthly_orders["Moving_Avg_Orders"] = monthly_orders["Total Orders"].rolling(window=3).mean()
fig = px.scatter(
    monthly_orders,
    x="Month",
    y="Total Orders",
    title="Monthly Total Orders with 3-Month Moving Average",
)

# Add moving average as a line
fig.add_scatter(
    x=monthly_orders["Month"],
    y=monthly_orders["Moving_Avg_Orders"],
    mode="lines",
    name="3-Month Moving Average",
    line=dict(color="red", width=3)
)


fig.update_xaxes(rangeslider_visible=True)

fig.show()

Performance Analysis

This bubble chart visualizes customer-level performance by plotting Total Orders on the x-axis against Total Sales on the y-axis. Bubble area (size) encodes Total Quantity purchased, and color groups customers by Customer Segment. It’s designed to answer questions such as:

  • Which customers generate high revenue relative to order volume?
  • Which segments contain high-frequency / high-volume buyers?
  • Are there outliers (very large buyers or unusually high order counts with low revenue)?

How to read the chart

  • X (Total Orders): horizontal position — higher → more orders placed.
  • Y (Total Sales): vertical position — higher → more revenue contributed.
  • Bubble size (Total Quantity): larger bubbles indicate greater total units bought; size_max=60 caps visual size.
  • Color (Customer Segment): quick visual clustering by segment.
  • Quadrants interpretation:
    • Top-right: many orders and high sales (high-value, frequent buyers).
    • Top-left: few orders but high sales (large basket size — low frequency, high value).
    • Bottom-right: many orders but low sales (frequent small purchases).
    • Bottom-left: low frequency and low revenue (low-priority customers).
In [73]:
px.scatter(
    customer,
    x="Total Orders",
    y="Total Sales",
    size="Total Quantity",
    color="Customer Segment",
    title="Total Orders vs Total Sales by Customer Segment",
    labels={"Total Orders": "Total Orders", "Total Sales": "Total Sales", "Total Quantity": "Total Quantity"},
    template="plotly_white",
    size_max=60  # max bubble size
)

Part to Whole - Share of Total Sales by Customer Segment

Visualize how total revenue is distributed across customer segments. This part-to-whole pie chart answers: Which customer segments contribute the largest share of revenue? and Is revenue concentrated in a few segments or spread evenly?

What the chart shows / how to read it

  • Each slice represents a customer segment; slice area (and angle) corresponds to that segment’s share of aggregated Total Sales.
  • The legend (or slice labels) shows segment names and — depending on your Plotly settings — hover will show absolute sales and percent-of-total.
  • Large slices indicate segments that contribute disproportionately to revenue; many small slices indicate a more even distribution.
In [74]:
px.pie(
    customer,
    names="Customer Segment",
    values="Total Sales",
    title="Share of Total Sales by Customer Segment",
    template="plotly_white"
)

Distribution of Total Sales by Age Group (Box Plot)

This box plot visualizes the distribution of total sales within each customer age group. It allows us to compare central tendencies (medians), spread (interquartile range), and outliers across different age cohorts.

How to read the chart

  • Box boundaries: show the interquartile range (IQR = 25th–75th percentile of sales values).
  • Line inside the box: median sales value for the group.
  • Whiskers: extend to the typical minimum and maximum values (within 1.5 × IQR).
  • Points outside whiskers: outliers (customers with unusually high or low sales relative to peers).
  • Color: corresponds to the Age Group category for easier distinction.
In [75]:
px.box(
    customer,
    x="Age Group",
    y="Total Sales",
    color="Age Group",
    title="Distribution of Total Sales by Age Group",
    labels={"Age Group": "Age Group", "Total Sales": "Total Sales"},
    template="plotly_white"
)

Distribution of Total Sales by Customer Segment and Age Group (Violin Plot)

This violin plot displays the distribution of total sales across customer segments, while simultaneously comparing differences within each age group. It combines the benefits of a box plot (medians, quartiles) with a kernel density estimate to show the shape of the data.

How to read the chart

  • Width of the violin: relative density of data points (wider = more customers with that sales value).
  • Vertical spread (y-axis): range of Total Sales values within each customer segment.
  • Color: distinguishes age groups inside each segment, allowing cross-segment comparisons.
In [76]:
px.violin(
    customer,
    x="Customer Segment",
    y="Total Sales",
    color="Age Group",
    title="Distribution of Total Sales by Customer Segment and Age Group",
    labels={"Age Group": "Age Group", "Total Sales": "Total Sales"},
    template="plotly_white")

Recency vs Lifespan by Customer Segment (Scatter Plot)

Purpose: This scatter plot explores the relationship between recency (how recently a customer made their last purchase) and lifespan (the duration from their first to last recorded purchase). It adds context with customer segment (color) and total orders (bubble size).

How to read the chart

  • X-axis (Recency): lower values = more recent purchases; higher values = longer since last order.
  • Y-axis (Lifespan): higher values = longer active customer history.
  • Bubble size: total order volume — larger bubbles = higher purchasing frequency.
  • Color: distinguishes customer segments for comparative analysis.
In [77]:
px.scatter(customer, 
           x="Recency (Days)",
              y="Lifespan (Days)",
                color="Customer Segment",
                size="Total Orders",
                title="Recency vs Lifespan by Customer Segment",
                labels={"Recency (Days)": "Recency (Days)", "Lifespan (Days)": "Lifespan (Days)", "Total Orders": "Total Orders"},
                template="plotly_white")

Product Data¶

Trend Analysis

Total Sales Trend Analysis (Line Chart)

This line chart tracks total product sales over time, showing day-level fluctuations and long-term revenue trends. It helps answer: Are sales increasing, declining, or seasonal over the observed period?

How to read the chart

  • X-axis (Date): timeline of sales activity.
  • Y-axis (Sales): total sales amount per day.
  • Line: connects daily values to highlight fluctuations and overall trends.
  • Upward slope: increasing sales momentum.
  • Downward slope: declining sales.
  • Spikes or dips: unusual events (campaigns, seasonality, data anomalies).
In [78]:
product["Last Sale Date"] = pd.to_datetime(product["Last Sale Date"], format="%d.%m.%Y")
bydate = product.groupby("Last Sale Date")["Total Sales"].sum().reset_index()

fig = px.line(
    bydate,
    x="Last Sale Date",
    y="Total Sales",
    title="Total Sales Over Time",
    labels={"Last Sale Date": "Date", "Total Sales": "Sales"}, 
    template="plotly_white", 
)
fig.update_layout(title_text='Total Sales Over Time', title_x=0.5)
fig.show()

Monthly Cumulative Sales (Area Chart with Yearly Reset)

This chart visualizes cumulative sales growth within each year, resetting the total at the start of every calendar year. It helps compare how sales performance evolves month by month across different years.

How to read the chart

  • X-axis (Month): timeline across years, with each year’s series plotted separately.
  • Y-axis (CumSalesYr): cumulative sales within a given year.
  • Colored areas: each color represents one calendar year.
  • Steeper slope: stronger sales momentum during that period.
  • Flatter slope: slower growth or plateauing sales.
In [79]:
monthly = product.copy()
monthly["Month"] = monthly["Last Sale Date"].dt.to_period("M").dt.to_timestamp()
monthly["Year"] = monthly["Month"].dt.year
monthly = monthly.groupby("Month")["Total Sales"].sum().reset_index()
monthly["Year"] = monthly["Month"].dt.year
monthly = monthly.sort_values(["Year","Month"])
monthly["CumSalesYr"] = monthly.groupby("Year")["Total Sales"].cumsum()

fig = px.area(monthly, x="Month", y="CumSalesYr", color=monthly["Month"].dt.year.astype(str),
              title="Monthly cumulative sales (yearly reset)")
fig.show()

Performance Analysis

Performance Analysis — Average Spend by Age Group and Category (Grouped Bar Chart)

This grouped bar chart highlights how different age groups spend on various product categories on average. It enables comparison both across categories and within age groups.

How to read the chart

  • X-axis (Category): product categories.
  • Y-axis (Average Spend): mean sales value per group.
  • Bar color: represents age groups.
  • Grouped bars: each category displays side-by-side bars for each age group, allowing direct comparison.
  • Taller bar: higher average spend in that category for that age group.
In [80]:
# aggregate: average spend per Age Group & Category
avg_spend = (
    subset.groupby(["Age Group", "Category"], as_index=False)
          ["Total Sales"].mean()
)

fig = px.bar(
    avg_spend,
    x="Category",
    y="Total Sales",       # now y is numeric (average spend)
    color="Age Group",
    barmode="group",       # side-by-side bars
    title="Average Spend by Age Group",
    labels={
        "Total Sales": "Average Spend",
        "Age Group": "Age Group",
        "Category": "Category"
    }
)

fig.update_traces(marker_line_width=1, marker_line_color="black")  # add outlines
fig.update_layout(bargap=0.2)  # add spacing between groups

fig.show()

Performance Analysis — Orders vs. Sales by Age Group (Bubble Chart)

This bubble chart compares customer purchase behavior across age groups, plotting orders, sales, and quantity simultaneously. It helps identify which age groups contribute most in terms of volume (orders), value (sales), and scale (quantity).

How to read the chart

  • X-axis (Total Orders): volume of orders placed.
  • Y-axis (Total Sales): revenue generated.
  • Bubble size: larger bubbles indicate higher total quantity purchased.
  • Bubble color: distinguishes customer age groups.
  • Position & size together: shows how age groups differ in both value and volume contribution.
In [81]:
px.scatter(
    subset,
    x="Total Orders",
    y="Total Sales",
    size="Total Quantity",
    color="Age Group",
    title="Total Orders vs Total Sales by Customer Segment",
    labels={"Total Orders": "Total Orders", "Total Sales": "Total Sales", "Total Quantity": "Total Quantity"},
    template="plotly_white",
    size_max=60  # max bubble size
)

Performance Analysis — Share of Total Sales by Age Group (Pie Chart)

This pie chart highlights how much each age group contributes to overall sales. It provides a part-to-whole view of customer demographics and their revenue share.

How to read the chart

  • Slices: each slice represents one age group.
  • Slice size: proportional to total sales contributed by that group.
  • Colors: differentiate age groups for easy visual comparison.
  • Larger slice: higher sales contribution relative to other groups.
In [82]:
px.pie(
    subset,
    names="Age Group",
    values="Total Sales",
    title="Share of Total Sales by Age Group",
    template="plotly_white"
)